Search CORE

72 research outputs found

Towards using web-crawled data for domain adaptation in statistical machine translation

Author: Giagkou Maria
Papavassiliou Vassilis
Pecina Pavel
Prokopidis Prokopis
Toral Antonio
Way Andy
Publication venue
Publication date: 30/05/2011
Field of study

This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-speciﬁc data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase--based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language pairs: English–French and English–Greek

DCU Online Research Access Service

Web crawling and domain adaptation methods for building English–Greek machine translation systems for the culture/tourism domain

Author: Papavassiliou Vassilis
Prokopidis Prokopis
Sofianopoulos Sokratis
Sánchez-Cartagena Víctor M.
Publication venue
Publication date: 01/06/2020
Field of study

Informe técnico sobre el trabajo realizado por Víctor Manuel Sánchez Cartagena en una estancia en "Athena Research and Innovation Center", mientras estaba contratado por la empresa Prompsit Language Engineering y era colaborador honorífico en el Departamento de Lenguajes y Sistemas Informáticos de la Universidad de Alicante.This paper describes the process we followed in order to build English-Greek machine translation systems for the tourism/culture domain. We experimented with different data sets and domain adaptation methods for statistical machine translation and also built neural machine translation systems. The in-domain data were obtained by means of the ILSP Focused Crawler.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

Repositorio Institucional de la Universidad de Alicante

Towards a Frame Semantics Lexical Resource for Greek

Author: Desipri Elina
Gotsoulia Voula
Koutsombogera Maria
Markopoulos George
Papageorgiou Harris
Prokopidis Prokopis
Publication venue
Publication date: 29/11/2007
Field of study

Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 55-59. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

DSpace at Tartu University Library

GREEK-BERT: The Greeks visiting Sesame Street

Author: Chalkidis Ilias
Devlin Jacob
Gage Philip
Koehn Philipp
Kudo Taku
Lafferty D.
Lample Guillaume
Lan Zhenzhong
Mikolov Tomas
Ortiz Suárez Pedro Javier
Outsios Stamatis
P.
Prokopidis Prokopis
Prokopidis Prokopis
Sebastian Ruder
Vaswani Ashish
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/09/2020
Field of study

Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual BERT-based language model for modern Greek. We evaluate its performance in three NLP tasks, i.e., part-of-speech tagging, named entity recognition, and natural language inference, obtaining state-of-the-art performance. Interestingly, in two of the benchmarks GREEK-BERT outperforms two multilingual Transformer-based models (M-BERT, XLM-R), as well as shallower neural baselines operating on pre-trained word embeddings, by a large margin (5%-10%). Most importantly, we make both GREEK-BERT and our training code publicly available, along with code illustrating how GREEK-BERT can be fine-tuned for downstream NLP tasks. We expect these resources to boost NLP research and applications for modern Greek.Comment: 8 pages, 1 figure, 11th Hellenic Conference on Artificial Intelligence (SETN 2020

arXiv.org e-Print Archive

Crossref

Third version (v4) of the integrated platform and documentation

Author: Bel Nuria
Del Gratta Riccardo
Hamon Olivier
Poch Marc
Prokopidis Prokopis
Quochi Valeria
Thurmair Gregor
Toral Antonio
Publication venue
Publication date
Field of study

The deliverable describes the third and final version of the PANACEA platform

PUblication MAnagement

D3.1. Architecture and design of the platform

Author: B?l Nuria
Del Gratta Riccardo
Hamon Olivier
Poch Riera Marc
Prokopidis Prokopis
Schnober Carsten
Thurmair Gregor
Publication venue
Publication date
Field of study

This document aims to establish the requirements and the technological basis and design of the PANACEA platform. These are the main goals of the document: - Survey the different technological approaches that can be used in PANACEA. - Specify some guidelines for the metadata. - Establish the requirements for the platform. - Make a Common Interface proposal for the tools. - Propose a format for the data to be exchanged by the tools (Travelling Object). - Choose the technologies that will be used to develop the platform. - Propose a workplan

PUblication MAnagement

Overview of the CLEF 2008 Multilingual Question Answering Track

Author: Alegria Iñaki
Forascu Corina
Forner Pamela
Moreau Nicolas
Osenova Petya
Peñas Anselmo
Prokopidis Prokopis
Rocha Paulo
Sacaleanu Bogdan
Sang Erik Tjong Kim
Sutcliffe Richard
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2009
Field of study

Crossref

Repositório Comum

D4.1. Technologies and tools for corpus creation, normalization and annotation

Author: Aleksic Vera
B?l Nuria
Bartolini Roberto
Caselli Tommaso
Frontini Francesca
Hamon Olivier
Papavassiliou Vassilis
Pecina Pavel
Poch Riera Marc
Poibeau Thierry
Prokopis Prokopidis
Rimell Laura
Thurmair Gregor
Publication venue
Publication date
Field of study

The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

PUblication MAnagement

D6.1: Technologies and Tools for Lexical Acquisition

Author: Abrate Matteo
Bacciu Clara
Bel Nuria
Caselli Tommaso
Gavrilidou Maria
Korhonen Anna
Monachini Monica
Padr? Muntsa
Poibeau Thierry
Prokopidis Prokopis
Quochi Valeria
Revilla Eva
Rimell Laura
Tesconi Maurizio
Publication venue
Publication date
Field of study

This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)

PUblication MAnagement

Adquisición automática de recursos para traducción automática en el proyecto Abu-MaTran

Author: Esplà-Gomis Miquel
Ferrández-Tordera Jorge
Forcada Mikel L.
Klubička Filip
Ljubešić Nikola
Ortiz Rojas Sergio
Papavassiliou Vassilis
Pirinen Tommi
Prokopidis Prokopis
Ramírez Sánchez Gema
Rubino Raphaël
Sánchez-Cartagena Víctor M.
Toral Antonio
Way Andy
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2015
Field of study

This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been developed with the aim of being useful for industrial exploitation.Este artículo presenta una panorámica de las actividades de investigación y desarrollo destinadas a aliviar el cuello de botella que supone la falta de recursos lingüísticos en el campo de la traducción automática que se han llevado a cabo en el ámbito del proyecto Abu-MaTran. Hemos desarrollado un conjunto de herramientas para la adquisición de los principales recursos requeridos por las dos aproximaciones m as comunes a la traducción automática, modelos estadísticos (corpus) y basados en reglas (diccionarios y reglas). Todas estas herramientas han sido publicadas con licencias libres y han sido desarrolladas con el objetivo de ser útiles para ser explotadas en el ámbito comercial.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas